Naming issues found in v030.

New assembly workflow

1) taking v031 (cd-hit-est of 10 assemblies) then selecting only sequences >20,000 bp.

cgigas_alpha_v031 subset _20k.fa

2) downloading BAC
renaming
Galaxy56-[Tabular-to-FASTA_on_data_55].fasta
running through CD-HIT_EST

./cd-hit-est -i /Volumes/Bay4\ scratch/temp/Galaxy56-[Tabular-to-FASTA_on_data_55].fasta -o /Volumes/Bay4\ scratch/temp/CgigasBAC_cdhit -M 2500


total seq: 60
longest and shortest : 203422 and 84264
Total letters: 8610155
Sequences have been sorted

Approximated minimal memory consumption:
Sequence        : 8M
Buffer          : 1 X 2068M = 2068M
Table           : 1 X 16M = 16M
Miscellaneous   : 4M
Total           : 2098M

Table limit with the given memory limit:
Max number of representatives: 4194304
Max number of word counting entries: 50200207

comparing sequences from          0  to         60

       60  finished         53  clusters

Apprixmated maximum memory consumption: 2150M
writing new database
writing clustering information
program completed !

Total CPU time 148

CgigasBAC_cdhit




3) adding 12 select genes


=


cgigas_alpha_v032.fa
http://aquacul4.fish.washington.edu/~steven/filefish/cgigas_alpha_v032.fa